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Abstract 



Approximate Bayesian inference on the basis of summary statistics is well- 
suited to complex problems for which the likelihood is either mathematically 
or computationally intractable. However the methods that use rejection suf- 
fer from the curse of dimensionality when the number of summary statistics 
is increased. Here we propose a machine-learning approach to the estimation 
of the posterior density by introducing two innovations. The new method 
fits a nonlinear conditional heteroscedastic regression of the parameter on 
the summary statistics, and then adaptively improves estimation using im- 
portance sampling. The new algorithm is compared to the state-of-the-art 
approximate Bayesian methods, and achieves considerable reduction of the 
computational burden in two examples of inference in statistical genetics and 
in a queueing model. 

Keywords : Approximate Bayesian computation | Conditional density es- 
timation | Implicit statistical models | Importance sampling | non-linear re- 
gression | indirect inference 
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1 Introduction 



Making use of simulations to perform non-Bayesian inference in models for 
which the likelihood is neither analytically solvable nor computationally 
tractable has a well-established methodology that finds its roots at least 
in the seminal papers of Diggle and Gratton (1984) and Gourieroux et al. 
(1993). This approach bypasses explicit likelihood functions by simulating 
from an implicit statistical model - that is, a model defined in term of a 
stochastic generating mechanism. 

In the Bayesian setting, there has been a growing interest in implicit 
statistical models for demographic inference in population genetics (Marjo- 
ram and Tavare 2006). Statistical inference with population-genetic data 
usually requires prior knowledge on genealogical trees. As the tree is usu- 
ally considered as a nuisance parameter, Bayesian Monte Carlo is a natu- 
ral approach to average over high-dimensional tree space. Although many 
likelihood-based methods have been proposed in this framework, these meth- 
ods are restricted to particular demographic and genetic processes (Stephens 
and Donnelly 2000; Wilson et al. 2003; Kuhner 2006; Hey and Nielsen 2007). 
Alternatively, likelihood-free methods, named approximate Bayesian compu- 
tation (ABC) after Beaumont et al. (2002), have gained many advocates 
in the recent years. The principle of standard ABC (Tavare et al. 1997; 
Pritchard et al. 1999; Beaumont et al. 2002) relies on the simulation of large 
numbers of data sets using parameters drawn from the prior distribution. A 
set of summary statistics is then calculated for each simulated sample, and 
compared with the values for the observed sample. Parameters that have 
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generated summary statistics close enough to the observed data are retained 
to form an approximate sample from the posterior distribution. 

Approximate Bayesian estimation algorithms - that were originally all 
based on rejection algorithms - can be classified into three broad categories, 
resembling the mainstream methods that are applied in standard computa- 
tional Bayesian statistics (German et al. 2003). The first class of methods 
relies on the direct rejection algorithm as described in the previous paragraph 
(Tavare et al. 1997; Pritchard et al. 1999). The second class of algorithms 
mimics Markov chain Monte Carlo methods (MCMC, Robert and Casella 
2004), embedding simulations from the implicit model in the updating step 
of the stochastic algorithm (Marjoram et al. 2003; Bortot et al. 2007). 
The MCMC-ABC algorithm takes into account the distance between the 
simulated and the observed summary statistics into the Metropolis-Hastings 
acceptance rule. The third class of algorithms shares similarity with the re- 
cently introduced sequential Monte Carlo samplers (SMC, Liu 2001). The 
main SMC- ABC algorithm combines ideas underlying rejection methods and 
sequential importance sampling (Sisson et al. 2007; Beaumont et al. 2008; 
Toni et al. 2009). 

Nevertheless, a severe limitation of rejection-based generative algorithms 
arises when the dimensionality of the set of summary statistics increases. 
Because the three classes of methods attempt to sample from a small mul- 
tidimensional sphere around the observed summary statistics, all of them 
suffer from the curse of dimensionality (see e.g. Hardle et al. 2004). To 
overcome this problem, Beaumont et al. (2002) allowed larger acceptance 
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rates in the rejection algorithm, ranging up to 20 percent of the simulated 
values, and then performed local linear adjustment in order to correct for 
the discrepancy between the simulated and the observed statistics. Here, 
we address the curse of dimensionality issue by adopting a machine learning 
perspective constructing a functional relationship between the generated set 
of summary statistics and the model parameters. Assuming perfect construc- 
tion, this relationship could ideally be utilized to produce samples from the 
posterior distribution by exploiting information not restricted to a subset of 
generated values, but to the extended set. 

In the first stage, our approach infers the functional relationship linking 
the summary statistics to the model parameters by considering a flexible non- 
linear conditional heteroscedastic (NCH) model. Flexible regression models, 
like neural networks, are exploited to reduce dimension and to better account 
for the correlation within the set of summary statistics. In the second stage, 
we introduce an adaptive version of the NCH (ANCH) algorithm using im- 
portance sampling. The rationale of the adaptive algorithm is to iteratively 
limit the discrepancy between the sampling distribution and the posterior 
distribution, which may be particularly useful when the prior distribution 
is vague. In two historical examples of demographic inference in population 
genetics and in an example of a queueing process, we provide evidence that 
the NCH and the ANCH algorithms reduce the computational burden when 
compared to ABC with rejection and to ABC with local linear adjustment. 
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2 Method 



Rejection and weighting. In ABC, we assume that there is a multidi- 
mensional parameter of interest </>, and the observed value s of a set of sum- 
mary statistics S is calculated for the data. To make statistical inference, a 
rejection-sampling method generates random draws Sj) where (pi is sam- 
pled from the prior distribution, and Sj is measured from synthetic data, 
simulated from a generative model with parameter (pi. Setting the tolerance 
error to a value 5 and denoting by |.| the Euclidean norm, only parameters 
4>i such that ||s — Sj|| < 5 are retained. Because the summary statistics may 
span different scales, norms that use rescaled distances are often considered 
in place of the Euclidean distance. In our application of ABC, we rescale dis- 
tances by the median absolute deviation of the simulated summary statistics 
Si, i = 1, . . . , M. For this basic rejection algorithm, the accepted (pi form a 
random sample from the approximate posterior distribution defined as 

p* (^|s) oc Pr(||s - Si || <5\4>)p{4>) 

where p((p) denotes the prior distribution. Compared to the exact expression 
of the posterior distribution, the likelihood is replaced by 

p(sl0) w Pr(l|s - ^11 <5\(f>). 

If the summary statistics are sufficient with respect to the parameter 0, 
the approximate posterior distribution converges to the true posterior dis- 
tribution as 5 goes to 0. In addition, the approximate posterior distribution 
corresponds to the prior distribution when 5 is large. Note that more generic 
interpretations of the ABC algorithm can be found in Wilkinson (2008). 
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Beaumont et al. (2002) introduced a first improvement of the stan- 
dard rejection ABC algorithm in which the parameters (pi were weighted 
by the values iiQ(||sj — s||), where K$ is the Epanechnikov kernel. Using 
this weighting scheme, an estimator of the posterior mean was then given 
b y Y^i K s{\\si ~ s\\)(f>i/J2i K s{\\si - s||). Although this was not originally- 
stated, it can be seen that it corresponds to the Nadaraya- Watson estima- 
tor, a classic approach to nonparametric regression (Nadaraya, 1964; Watson 
1964). Note that the Nadaraya- Wat son smoother is known to be subject to 
the curse of dimensionality since the rate of convergence of the estimator 
decreases dramatically as the dimension of the summary statistics increases 
(see e.g. Hardle et al. 2004). 

A local linear model. To avoid the curse of dimensionality, Beaumont 
et al. (2002) also described the posterior density as a homoscedastic linear 
regression model (in fact, a local-linear model) of the form 

& = a + ( Si - s) T /3 + Ci, z = l,...,M, (1) 

where a is an intercept, (3 is a vector of regression coefficients, and the Q's 
are independent random variates with mean zero and common variance. We 
further refer to this algorithm as the LocL ABC model. In the LocL model, 
the parameters (a, (3) are inferred by minimizing the weighted least-squares 
criterion 

n 

- (a + (a* - s obs Yf3)} 2 K s (\\ Sl - s||). (2) 

i=i 
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If equation ([T]) exactly describes the relationship between and s, random 
draws of the posterior distribution can simply be obtained as a + Q, for i = 
1, . . . , M. Using the empirical residuals in place of the ^'s, the parameters 
are adjusted as 

# = fa - - s) T /3. (3) 

Weighted by J^(||sj — s||), the 0*'s, i = 1, . . . , M, provide an approximate 
sample from the posterior distribution. In this approach, the choice of 5 
involves a bias-variance trade-off: Increasing S reduces variance thanks to a 
larger sample size for fitting the regression, but also increases bias arising 
from departures from linearity and homoscedasticity. 

A nonlinear conditional heteroscedastic model. In this study, we in- 
troduce an important modification on the previously described adjustment- 
based ABC method for conditional density estimation. In order to minimize 
departures from linearity and homoscedasticity, we propose to model both 
the location and the scale of the response parameter, fa, in equation ([IT) 
(see e.g. Fan and Yao 1998). The new regression model takes the form of a 
nonlinear conditional heteroscedastic model 

fa = m(8i) + cr(Si) x Ci, i = l,...,M, (4) 

where m(sj) denotes the conditional expectation, E[0|S = Sj], and <r 2 (sj) 
denotes the conditional variance, Var[0|S = Sj]. 

The conditional expectation can be estimated as rh(s,) by fitting a flexible 
non-linear regression model. The variance term is then estimated using a 
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second regression model for the log-residuals 



log(0j - m(si)) 2 = log<7 2 (si) + £ 



i = 



1,...,M, 



(5) 



where the £j's are independent random variates with mean zero and com- 
mon variance. In our forthcoming examples, we consider feed-forward neural 
network (FFNN) regression models (Ripley 1996; Bishop 2006). This choice 
is motivated by the fact that FFNNs include the possibility to reduce the 
dimensionality of the set of summary statistics via internal projections on 
lower dimensional subspaces (Bishop, 2006). 

In FFNN regression models, linear combinations of the inputs — i.e. the 
summary statistics — are transformed to compute the values Zj, j = 1, . . . , H 
at the H hidden units 



where s k , k = 1, . . . , D, is the k th component of s, H is the number of hidden 
units in the network, D is the dimension of the vector s of summary statistics, 
the u>jk 's, k = 0, . . . , D, j = 1, . . . , H are the weights of the first layer of the 
neural network, and h denotes the logistic function. Because the number of 
hidden units H is typically smaller that the number of summary statistics D, 
equation ([6]) reduces the initial dimension of the vector of summary statistics. 
The values Zi, i — 1, . . . , H , of the hidden units are then linearly combined 
to produce the output g w (s) of the FFNN 



D 





k=l 
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(7) 
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where the Wj s, j = 0, . . . , H, are the weights of the second layer of the 
neural network. Note that logistic regression can easily be performed within 
the FFNN framework by transforming the linear combination of equation 
([7j) with a logistic function. The extension to general discrete parameters is 
obtained using a softmax transformation (see Bishop 2006). 

We use FFNNs for fitting both m(sj) and logcr 2 (sj) (Nix and Weigend 
1995). The weights w of a first FFNN are found by minimizing the regular- 
ized least-squares criterion 

n 

^{0,-( 7w (s)} 2 A' 5 (||s l -s||) + A||w|| 2 , 

i=l 

where A represents the regularization parameter, called weight-decay param- 
eter in the neural network literature. The weights of a second FFNN esti- 
mating the conditional variance, are found by minimizing 

n 

X>g((<fc - m( Si )) 2 ) - ^(s)} 2 /i,(|| Sl - s||) + A||w'|| 2 . 

i=l 

Similarly to equation (j3J), parameter adjustment under the NCH model 
can be performed as follows 

# = m(s) + (fa - m(s<)) x ?f\, i = l,...,M. (8) 

Assuming that fa = m(sj) + a(si)Q corresponds to the true relationship 
between <pi and s i; then (0*) forms a random sample from the distribution 
p(0|s) provided that rh could be considered equal to m and a equal to cr. 

Similarly to the LocL ABC method, a tolerance error, 5, is allowed, and 
the adjusted parameters, 0*, are weighted by As(||sj — s||). Furthermore, 
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to warrant that the adjusted parameters, 0*, obtained from equation Q 
or (jSJ), fall in the support of the prior distribution, we sometimes consider 
transformations of the original responses. Parameters that lie in an inter- 
val are transformed via the logit function, and nonnegative parameters are 
transformed using a logarithm. These transformations have the further po- 
tential advantage of stabilizing the variance of the response when performing 
regression (Box and Cox 1964). 

Iterated importance sampling. A second change to the LocL ABC al- 
gorithm converts the single-stage regression based ABC method into a multi- 
stage algorithm in which estimations are improved iteratively (Liu 2001; Sis- 
son et al. 2007). In practice, we implemented a two-stage algorithm. The 
logic of using a two-stage algorithm is that the second run can control a first 
run with a high acceptance rate, and adaptively builds a better approxima- 
tion of the posterior distribution. If the two empirical distributions obtained 
after each step of the ANCH algorithm agree, then the results can be pooled 
to form a larger approximate sample from the posterior distribution. 

The adaptive NCH algorithm can be described as follows: Starting from 
a sample (<fi\) obtained from a first NCH ABC run, the adaptive step of the 
algorithm consists of estimating the support Aj of the sample. Then new 
parameters are proposed from the conditional prior distribution pAj given 
that they fall in A^ This can be implemented using a simple rejection step. 
For <fi having a moderate number of dimension, this is usually achieved at a 
computational cost which can be considered significantly lower than the cost 
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of simulating from the generating distribution, p(s|0). Using this new set of 
parameters, a second sample, (</>f), can be formed using the NCH method 
again. To compensate for the fact that we do not sample from the prior 
distribution, each value 4>f should in principle be weighted by P^DIp^A^D- 
Here, the importance weights need not to be computed because we have 
PAi(0f) = K0!)/p(Ai)j which means that the ratio p((pf)/pA 1 (4>'i) does not 
depend on i. In this two-stages approach, we suppose that Ai approximates 
the support of p(s\<p) accurately. For multi-dimensional parameters 0, we 
estimate the support of the distribution using support vector machines (SVM, 
SchSlkopf et al. 2001). 

3 Examples of implicit statistical models 

In this section, we present three examples of implicit statistical models, two 
of which have received considerable attention in population genetics, and 
the last one has served to illustrate indirect inference (Gourieroux et al. 
1993; Heggland and Frigessi 2004). Using these examples, we performed an 
empirical evaluation of the relative performance of three regression-based ap- 
proximate Bayesian algorithms, the local linear regression model (LocL ABC 
model), the non-linear conditional heteroscedastic model (NCH model), and 
its adaptive implementation (ANCH model). We used the R programming 
language to implement the LocL, NCH and ANCH algorithms. Least squares 
adjustment for neural networks was implemented using the R package nnet 
(R Core Team 2008). Model choice for neural networks was based on a 
Bayesian (or regularization) approach (Ripley 1996). We used H = 4 hid- 
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den units and the weight-decay regularization parameter was set equal to 
A = 0.001. Increasing the weight-decay parameter will increase the bias of 
the estimator, but it will also decrease its variance. At this stage, cross- 
validation might be a useful alternative approach to the choice of a FFNN 
model, but the previous values for A and H proved to work well in the exam- 
ples considered here. In the ANCH algorithm, the support of the conditional 
density was estimated using a SVM ^-regression algorithm [y = 0.005) as 
implemented in the R package el071 based on the public library libsvm 
(Chang and Lin 2001). 

Two examples in population genetics. There has been tremendous in- 
terest in simulation-based inference methods in evolutionary biology during 
the last decade (Fu and Li 1997; Pritchard et al. 1999; Fagundes et al. 
2007). In these applications, the inference of demographic and genetic pa- 
rameters depends on the so-called coalescent approximation which describes, 
in a probabilistic fashion, the ancestry of genes represented in a sample. Coa- 
lescent models provide good examples of implicit statistical models for which 
a straightforward stochastic generating mechanism exists, but the likelihood 
is usually computationally intractable (see the Supplementary Material for 
further information). 

Example 1. Given a set of n DNA sequences, the first problem concerns 
the estimation of the effective mutation rate, 9 > 0, under the infinitely- 
many-sites model. In this model, mutations occur at rate 9 at DNA sites 
that have not been hit by mutation before. If a site is affected by mutation, 
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it is said to be segregating in the sample. In this example, the summary 
statistic, s, is computed as the number of segregating sites. Note that s is 
not a sufficient statistic with respect to 9 (Fu and Li 1993). The generating 
mechanism for s can be described as follows. 

1. Simulate L n , the length of the genealogical tree of the n sequences, as 
the sum of independent exponential random variables of rate (j — l)/2, 
j = 2, . . . , n. 

2. Generate s according to a Poisson distribution of mean 6L n /2. 

The simulation of L n can be derived from the formula L n = J^™ =2 jX? that 
gives the total length of the tree as a function of the inter-coalescence times 
denoted by Yj, j = 2, . . . , n. The inter-coalescence times correspond to the 
times during which the sample has j ancestors, j = 2,...,n (Tavare et 
al. 1997). In a wide range of models in population genetics, the inter- 
coalescence times form independent random variables distributed according 
to the exponential distributions of rate j(j — l)/2, j = 2, . . . ,n. A more 
detailed description of the coalescent process can be found in (Tavare 2004). 

We computed the posterior distribution of the effective mutation rate 9 
given the observation of s = 10 segregating sites in a sample of n = 100 DNA 
sequences. The prior distribution for the parameter 9 was taken to be the 
exponential distribution of mean 50, which was meant to represent a vague 
prior. An sample from the posterior distribution was obtained using a direct 
rejection algorithm accepting only parameters that produced 10 segregating 
sites exactly. Ten millions of replicates were generated resulting in a sample 



14 



of size 39, 059 after rejection. 

For the ABC algorithms, we performed inference of the posterior distri- 
bution using a total of 2,000 simulations of the bivariate vector (9,s). We 
recorded the three quartiles and the 0.025 and 0.975 quantiles of the ap- 
proximate posterior distributions computed by the three algorithms, and we 
compared these 5 quantiles Qk, k=l,. . . ,5, with the corresponding empirical 
quantiles, Q° k , obtained from the exact sample. For values of the tolerance 
rate between and 1 and for each quantile, Qk, the accuracy of each al- 
gorithm was assessed by the relative median absolute error defined as the 
median of (Qk — Qk)/Qt computed over 150 runs. In addition, we measured 
the discrepancy between each approximate distribution and the empirical 
posterior distribution using the sum of the RMAE's over all quantiles. In 
the ANCH algorithm, the support was estimated as the range of the empir- 
ical distribution - i.e., the (0,max) interval - obtained after a first run. A 
total of 1,000 replicates were used in the first run and the tolerance rate, 
denotes as P$, was set to 75%. 

Comparisons with standard rejection algorithms were first conducted. 
We found that the posterior distribution obtained from the rejection meth- 
ods deviated from the empirical posterior distribution significantly for tol- 
erance rates larger than 10% (Figure [T]A.). The LocL model approximated 
the posterior distribution accurately for small tolerance rates (< 20%), but 
the performances of the LocL method deteriorated as the tolerance rate in- 
creased (Figure OQ curves with diamonds). The performances of the NCH 
model were significantly less sensitive to the tolerance rate, staying at values 



15 



close to the optimum achieved by the LocL model (Figure [TJ3, curve with 
triangles point-up). The adaptive NCH algorithm achieved even superior 
performances for values of the tolerance rate ranging between and 90% 
(Figure [T]B, curve with triangles point-down). The black dot in Figure [TB 
represents the performance of the ANCH algorithm without weighting and 
allowing total acceptance. Having eliminated the concept of rejection in the 
approximate Bayesian algorithm, i.e. setting Pg — 1, the accuracy of the 
algorithm remained close to the optimum achieved by all algorithms. This 
first example illustrates the benefit of the NCH model over the LocL model 
and the other rejection methods. The additional gain of the adaptive step 
stems from the use of a vague prior, which, in this case, gave low weight 
to the region of posterior values. We note, however, that for the smallest 
tolerance rate, the RMAE is smaller in the LocL model than in the NCH 
models. Because of the additional level of complexity introduced in FFNNs, 
FFNNs require more data points to be trained than local linear regression, 
but this not an issue for higher tolerance rates. 

Example 2. Turning to a more complex problem in which the posterior 
distribution could not be estimated easily, we considered an exponentially 
growing population model with 3 parameters. Similarly to Weiss and von 
Haeseler (1998), a population of size Na started to grow exponentially to 
years ago to reach a present size of N individuals where N = Na/ch for a 
value a G (0, 1). We performed inference on the two parameters Na, to and a 
was considered as a nuisance parameter. In this example the data consisted 



16 



of a sample of n individuals genotyped at a multilocus subset of independent 
microsatellite markers (see e.g. Zhivotovsky et al. 2003). Microsatellite 
loci are characterized by a motif of two to four nucleotides that may repeat 
itself several times, and the data are recorded as number of repeats for each 
individual. 

The generating mechanism for the implicit model can be described as 
follows. 

1. Simulate candidate coalescent genealogies with n tips in a growing pop- 
ulation for each marker, 

2. Superimpose mutations on the tree branches according to a specific 
mutation model. 

Step 1 requires simulating coalescence times in a coalescent model with vary- 
ing population size (see Tavare 2004). In step 2, we used the single-step 
mutation model, that can be viewed as a simple random walk for which the 
+1 and -1 steps are equally likely (Ohta and Kimura 1973). 

To capture the pattern of genetic variation, we computed six summary 
statistics previously reported to be sensitive to the genetic diversity in the 
sample and to the intensity of the demographic expansion. The amount of 
genetic diversity was measured by the mean (over the loci) of the variance in 
the number of repeats and by the mean of their heterozygosities (Pritchard 
and Feldmann 1996). For the demographic pattern, we used two imbalance 
indices studied by King et al. (2000), the interlocus statistic introduced 
by Reich and Goldstein (1998), and the expansion index of Zhivotovsky et 
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al. (2000). We also computed a seventh summary statistic based on an 
observation of Shriver et al. (1997) who studied the distribution, Sk, of 
pairwise comparisons that differ by K repeat units. This distribution has its 
peak at the value for a recent expansion, and the peak shifts to the value 
1 for older expansions. To compute the seventh statistic, we averaged the 
quantity Si — S over all the loci. 

We took a uniform prior distribution ranging from to 100,000 years 
for the onset of the expansion, a uniform distribution over the interval (0, 
10,000) for the ancestral population size, and a uniform distribution over the 
interval (1,6) for — log 10 (a). 

One hundred test data sets were generated using t = 18, 000 years for 
the date of onset of the expansion, Na = 1, 500 for the ancestral population 
size, and a = 0.0012 (log 10 (ct) = —2.92) for the ratio of the ancestral size 
to the present population size. These values were very similar to those used 
in Pritchard et al. (1999) in a study of the Y chromosome in humans. For 
each algorithm, we computed the posterior distribution of the 3 parameters 
given the observation of the 7 summary statistics in a sample of n = 100 
individuals surveyed at 50 microsatellite loci. 

For the three algorithms, we generated samples from the posterior dis- 
tribution using 2,000 replicates from the implicit model. We recorded the 
quartiles and the 0.025 and 0.975 quantiles of the output distributions for 
the three ABC algorithms, and we compared these 5 values for the condi- 
tional distributions of each parameter t 0; Na, and a. The median value of 
each quantile was then computed over 100 runs. 
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For each of the three parameters, the median estimates of the quantiles 
of the marginal posterior distribution were very similar in the NCH model 
and in the ANCH implementation (Figured black and red curves). For large 
tolerance rates (P$ > 50%), we observed a strong agreement with the values 
estimated by the LocL model (Figure [2} green curves) used with small toler- 
ance rates (P$ = 5%), indicating that the NCH model can efficiently exploit 
simulation results that fall far apart from the observed values of the summary 
statistics. The performances of the LocL model decreased as the tolerance 
rate increased above 20%, and the estimation of the conditional distribution 
of the ancestral size provided evidence that the bias was substantial (Figure 
El bottom right panel). The three algorithms gave similar results regarding 
the estimation of a (results not shown). 

To further compare the performances of the NCH model and its ANCH 
variant, we studied a particular simulated data set corresponding to the same 
ground truth as before. After running the three algorithms 100 times for each 
tolerance rate, the variance of each quantile in the posterior distribution was 
of order three times higher in the NCH model than in the ANCH algorithm 
(2,000 simulations for each algorithm, Table 1). Clear evidence of the stabi- 
lization phenomenon was also obtained under the infinitely-many-sites model. 
Given s = 10 segregating sites, we ran the ANCH algorithm using 200 sim- 
ulations of the implicit model at each step (tolerance rate P$ = 85%). To 
compare estimations obtained after the initial step with those obtained after 
adapting the support, we replicated the estimation procedure 100 times. The 
reduction in variance ranged from a factor 2.7 to a factor 34.7. The highest 
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reduction in variance was obtained for the estimation of the upper quantile. 

Example 3. Our third example arose from a totally different context, and 
was formerly studied by Heggland and Frigessi (2004) using indirect infer- 
ence. The connection of indirect inference to ABC is the following. Indirect 
inference (Gourieroux et al. 1993) is a non-Bayesian method that proceeds 
with 3 steps. 1) an auxiliary model is introduced, usually as a simplified 
version of the true model. 2) Estimates of the parameters in the auxiliary 
model are obtained and play the role of summary statistics. These estimates 
can be obtained, for example, by maximizing the likelihood in the auxiliary 
model. 3) An estimate of the parameter <p is built by minimizing a weighted 
Euclidean distance between simulated summary statistics and the observed 
summary statistics. Note that the introduction of an auxiliary model can 
also be a useful means of finding informative summary statistics for ABC 
methods. 

The model considered in (Heggland and Frigessi 2004) was a queueing 
system with a first-come-first-serve single-server queue (G/G/l). The service 
times were uniformly distributed in the interval [6i, 62}, and the inter-arrival 
times had exponential distribution with rate 6*3. Let W n be the inter-arrival 
time of the nth customer and U n be the corresponding service time. The 
process of inter-departure times {Y n , n = 1,2,...} can be described by the 
following generative algorithm 
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Bayesian inference on (0i,6 2 ,6 3 ) was done by assuming that only the inter- 
departure times were observed. Because the inter-arrival times were unob- 
served, likelihood-based inference would involve high-dimensional integra- 
tion. 

We generated a test data set with n = 50 successive inter-departure time 
observation using Q\ — 1, 62 — 5, and 6* 3 = 0.2. We set a uniform prior over 
[0, 10] for $i, 2 — 61, and for 6*3. To investigate the sensitivity of the NCH 
model to the number of summary statistics, we ran the ABC-NCH algorithm 
using 5, 10 and 20 summary statistics. Here, the set of summary statistics 
included the minimum and the maximum of the inter-departure times and the 
3, 8 and 18 equidistant quantiles of the inter-departure times. We used 10,000 
replicates and the tolerance rate was set to the value P$ = 50% resulting in 
a posterior sample of size 5,000. 

Figure [3] shows that posterior distributions of the 3 parameters had their 
mode and median values close to the ground truth values whatever the num- 
ber of summary statistics. This provided evidence that the NCH ABC algo- 
rithm was robust to an increase in the dimensionality of the set of summary 
statistics. In addition, we found that the posterior distributions were more 
concentrated around the true values when 20 summary statistics were used. 

To investigate the variability of the ABC algorithms from one run to 
the other, we ran the LocL, NCH and ANCH algorithms 100 times on the 
same data set. We used 2,000 replicates (a rather small number) in order to 
observed an exaggerated variability, and we varied the tolerance rate from 
to 1. 
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Figure H] displays estimated posterior quantiles for 3 . The LocL ABC 
algorithm was the less variable algorithm, but the posterior credibility inter- 
vals produced by this method were wider than those produced by the NCH 
and ANCH algorithms. The latter were less sensitive to the variation of the 
tolerance rates, and the medians of the posterior distributions were closer to 
the true value in the non-linear models than in the LocL model. Compared 
to the NCH algorithm, the ANCH algorithm reduced the variance of the 
quantile estimates. We suspect that the variance of the posterior quantile es- 
timates came from the use of local optimization during the learning phase of 
the feed- forward neural networks. For 81 and 82, the LocL ABC and the NCH 
ABC algorithms led to similar approximate posterior distributions when the 
tolerance rate was set at values close to zero in the LocL ABC algorithm. 

4 Discussion 

Approximate Bayesian computation encompasses a wide range of useful meth- 
ods for making inference in implicit statistical models. In this context re- 
jection algorithms have greatly benefited from ideas coming from regression- 
based conditional density estimation. So far conditional density estimation 
in ABC approaches has relied on linear adjustment exclusively (Beaumont et 
al. 2002). While the linear regression-based ABC method can approximate 
posterior distribution accurately, this is usually achieved at the expense of 
a heavy computational load. For example, using the LocL ABC method for 
estimating parameter in complex models of modern human expansion, Fa- 
gundes et al. (2007) required an amount of computational time equivalent 



22 



to 10 CPU-months. To increase the tolerance of the algorithm, we have pro- 
posed to use non-linear regression-based ABC. In three examples, non-linear 
neural networks proved to be able to reduce computational generation costs 
significantly. 

A heuristic reason why neural networks worked well when the number 
of summary statistics was large is that their first layer allows a nonlinear 
projection onto a subspace of much lower dimensionality, and nonlinear re- 
gression can then be performed using the reduced number of projection vari- 
ables. Increasing the number of summary statistics has a dramatic effect 
on the variability of the estimators of the conditional mean m(sj) and vari- 
ance <3" 2 (sj) and consequently inflates the variances of the estimated posterior 
distributions. The variance can be reduced with the Bayesian predictive ap- 
proach of Ripley (1996) that consists of training a large number of FFNNs 
for each conditional regression and averaging the results over the replicate 
networks. In addition Bayesian neural network theory includes general rules 
for choosing appropriate regularization parameters which makes the method 
rather automatic. Compared to other regression models, neural networks 
share many properties of projection pursuit regression (Friedman and Stuelze 
1981), which may then lead to equivalent performances. As well SVM have 
gained increased popularity in machine learning approaches during the recent 
years (Vapnik 1998), and the algorithms described here could be modified to 
include SVM regression without change in spirit. 

A second justification for using feed-forward neural networks is their abil- 
ity to implement probabilistic outputs, hence allowing a unified Bayesian 



23 



treatment of model choice. Indeed model choice may be performed by consid- 
ering the model itself as an additional parameter to infer. Beaumont (2007) 
proposed to estimate the posterior probability of each candidate model by 
an approach based on a weighted multinomial logistic regression procedure. 
This approach is an extension of logistic regression to more than two cate- 
gories, and it is equivalent to the use of a multinomial log-linear model. As 
they pertain to a more flexible class of models, neural networks may achieve 
equal or better predictive values than multinomial logistic regression (Ripley 
1996). 

The ABC approach has been recently used in connected domains like 
human population genetics (Pritchard et al. 1999, Fagundes et al. 2007), 
epidemiology (Tanaka et al. 2006, Toni et al. 2009, Blum and Tran 2008) 
or for the evolution of protein networks (Ratmann et al. 2007). It has also 
recently been suggested in the context of compositional data (Butler and 
Glasbey 2008), and applied in extreme value theory (Bortot et al. 2007) and 
Gibbs random fields (Grelaud et al. 2008). Although inference from synthetic 
data that mimic observations has a long lasting record in frequentist statistics 
(Diggle and Gratton 1984, Gourieroux et al. 1993), ABC is still in its infancy. 
Because the ABC method combines the power of simulating from stochastic 
individual-based models with sound methodological grounds from Bayesian 
theory, it has the potential to open doors to inference in many complex 
models in ecology, evolution, and epidemiology, or other domains like the 
social science. Improved statistical ABC models, like those presented in this 
study, will then be useful to deal with increased model complexity, and with 
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the need to raise the dimension of the vector of summary statistics. An R 
code for performing ABC with the NCH model is available at the author's 
website. 
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Figure legends 



Figure 1. Relative median absolute error (RMAE) when estimating the 
quantiles of the posterior distribution of 9 as a function of the tolerance rate. 
In panel A), the errors obtained with the regression algorithms are compared 
to the errors obtained with the local linear adjustment. In panel B), the 
different regression methods are compared. For the quartiles and for the 0.025 
and 0.975 quantiles, relative errors between approximate quantiles computed 
by the ABC methods and empirical values from the posterior distribution 
were computed over the 150 replicates. The sum of the RMAE's was obtained 
by summing, over the 5 quantiles, the values of the relative median absolute 
differences. The black dot corresponds to the rejection-free ANCH algorithm 
(tolerance rate P$ = 100%) without including the Epanechnikov weights. 

Figure 2. The posterior quantiles of the date of onset of expansion and 
the ancestral population size for the NCH and the ANCH methods (left) and 
LocL ABC method (right). The quantiles are plotted against the tolerance 
rate for the NCH and the ANCH methods (left) and LocL ABC method 
(right). For each algorithm, the curves represent the 0.025, 0.25, 0.5, 0.75, 
and 0.975 quantiles of the posterior distribution. These values correspond 
to the median over the 100 test data sets. The ground truth values are 
represented as the blue lines. The values of the quantiles estimated by the 
NCH and ANCH methods using a tolerance rate P$ = 75% match with those 
obtained from the LocL algorithm using P$ = 5%. 
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Figure 3. The posterior quantiles of 9±, 62, and #3, using the NCH ABC 
method, with 5, 10, and 20 summary statistics. The vertical lines correspond 
to the true values of the parameters that were used when simulating the data 
set. The tolerance rate was set at 50% and a total of 10,000 simulations were 
performed. 

Figure 4. The boxplot of the posterior quantiles for #3 as a function 
of the tolerance rate. The different ABC methods were run 100 times each 
using 2,000 simulations at each run. 
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Table caption 



Table 1. The benefit of adapting the support. Infinitely-many-sites: 
Ratios of variances for the quantiles estimated after the first NCH step and 
the second NCH step of the ANCH algorithm (acceptance rate P$ = 85% in 
each step). Expansion model: Ratios of variances for the quantiles estimated 
by the NCH and the ANCH algorithm (2,000 simulations in each algorithm, 
Ps = 75%). P- values were computed according to the F-test. 
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